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Abstract 



We propose a second-order accurate method to estimate the eigenvectors of extremely large matrices 
thereby addressing a problem of relevance to statisticians working in the analysis of very large datasets. 



More specifically, we show that averaging eigenvectors of randomly subsampled matrices efficiently 
approximates the true eigenvectors of the original matrix under certain conditions on the incoherence 
■ of the spectral decomposition. This incoherence assumption is typically milder than those made in 

matrix completion and allows eigenvectors to be sparse. We discuss applications to spectral methods in 
dimensionality reduction and information retrieval. 

<N ■ 1 Introduction 

> 

£T) , Spectral methods have a long list of applications in statistics and machine learning. Beyond dimensionality 

reduction techniques such as PC A or CCA [And03, MKB79], they have been used in clustering [NJW02], 
ranking & information retrieval [PBMW98,HTF + 01,LM05] or classification for example. Computationally, 
one of the most attractive features of these methods is their low numerical cost, in particular on problems 
qs^ ' where the data matrix is sparse (e.g. graph clustering or information retrieval). Computing a few leading 

eigenvalues and eigenvectors of a matrix, using the power or Lanczos methods for example, requires per- 
forming a sequence of matrix vector products and can be processed very efficiently. This means that when 
the matrix is dense and has dimension n, the cost of each iteration is 0(n 2 ) in both storage and flops. 
j_] ■ However, for extremely large scale problems arising in statistics or information retrieval for example, 



this cost quickly becomes prohibitively high and makes spectral methods impractical. In this paper, we 
propose a randomized, distributed algorithm to estimate eigenvectors (and eigenvalues) which makes spec- 
tral methods tractable on very large scale matrices. We show that our method is second order accurate and 
illustrate its performance on a few realistic datasets. 

Going back to the numerical cost of spectral methods, we see that decomposing each matrix vector 
product in many smaller block operations partially alleviates the complexity problem, but makes the over- 
all process very bandwidth intensive. Decomposition techniques thus improve the granularity of iterative 
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eigenvalue methods (i.e. require many cheaper operations instead of a single very expensive one), but at 
the expense of significantly higher bandwidth requirements. Here, we focus on methods that improve the 
granularity of large-scale eigenvalue computations while having very low bandwidth requirements, meaning 
that they can be fully distributed over many loosely connected machines. 

The idea of using subsampling to lower the complexity of spectral methods can be traced back at least to 
[GMKG91,PRTV00] who described algorithms based on subsampling and random projections respectively. 
Explicit error estimates followed in [FKV04, DKM06, AM07] which bounded the approximation error of 
either elementwise or columnwise matrix subsampling procedures. On the application side, a lot of work 
has been focused on the Pagerank vector, and [NZJ01] in particular study its stability under perturbations of 
the network matrix. Similar techniques are applied to spectral clustering in [HYJT08] and both works have 
close connections to ours. Following the Netflix competition on collaborative filtering, a more recent stream 
of works [RFP07,CR08,CT09, KMO09] has also been focused on exactly reconstructing a low rank matrix 
from a small, single incoherent set of observations. Finally, more recent "volume sampling" results provide 
relative error bounds [KV09], but so far, the sampling probabilities required to obtain these improved error 
bounds remain combinatorially hard to compute. 

Our work here is focused on the impact of subsampling on eigenvector approximations. First we seek 
to understand how far we can reduce the granularity of eigenvalue methods using subsampling, before re- 
constructing eigenvectors becomes impossible. This question was partially answered in [CT09, KMO09] 
for matrices with low rank, incoherent spectrum, using a single subset of matrix coefficients, after solving a 
convex program with high complexity. Here we make much milder assumptions on matrix incoherence. In 
particular, we allow some eigenvectors to be sparse (while remaining incoherent on their support) and we 
approximate eigenvectors using many simple operations on subsampled matrices. Under certain conditions 
on the sampling rate which guarantee that we remain in a perturbative setting, we show that simply aver- 
aging many approximate eigenvectors obtained by subsampling reduces approximation error by an order of 
magnitude. 



Notation. In what follows, we write S n the set of symmetric matrices of dimension n. For a matrix 
X G R mx ™, we write ||^||_f its Frobenius norm, \\X\\2 its spectral norm, fJj(X) its i-th largest singular 
value and let H-X^cc = maxy \Xij\, while Card(X) is the number of nonzero coefficients in X. We denote 
by X(i,j) or Xy its (i, j)-th element and by Mj the i-th column of M. Here, o denotes the Hadamard (i.e 
entry wise) product of matrices. When x G R n is a vector, we write its Euclidean norm ||x||2 and ||x||oo its 
^oo norm. We write 1 G R™ the vector having all entries equal to 1. Finally, k denotes a generic constant, 
whose value may change from display to display. 



2 Subsampling 

We first recall the subsampling procedure in [AM07] which approximates a symmetric matrix M G S„ 
using a subset of its coefficients. The entries of M are independently sampled as 

J Mij/p with probability p 
bij = \ otherwise, (1) 

where p G [0, 1] is the sampling probability. Theorem 1.4 in [AM07] shows that when n is large enough 



\M-S\\ 2 <4||M|| 0O VVp> ( 2 ) 
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holds with high probability. In what follows, we will prove a similar bound on || M — S\\2 using incoherence 
conditions on the spectral decomposition of M. 

2.1 Computational benefits 

Computing k leading eigenvectors and eigenvalues of a symmetric matrix of dimension n using iterative 
algorithms such as the power or Lanczos methods (see [GVL90, Chap. 8-9] for example) only requires 
matrix vector products, hence can be performed in 0(kn 2 ) flops when the matrix is dense. However, 
this cost is reduced to 0{k Card(M)) flops for sparse matrices M. Because the matrix S defined in (1) 
has only pn 2 nonzero coefficients on average, the cost of computing k leading eigenvalues/eigenvectors 
of S will typically be 1/p times smaller than that of performing the same task on the full matrix M. Of 
course, sampling the matrix S still requires 0(n 2 ) flops, but can be done in a single pass over the data 
and be fully distributed. In what follows, we will show that, under incoherence conditions, averaging the 
eigenvectors of many independently subsampled matrices produces second order accurate approximations 
of the original spectral decomposition. While the global computational cost of this averaging procedure may 
not be globally lower, it is decomposed into many much smaller computations, and is thus particularly well 
adapted to large clusters of simple, loosely connected machines (Amazon EC2, Hadoop, etc.). 
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Figure 1: Our objective here is to approximate the spectral decomposition problem of size 0{n 2 ) by solving 
many independent problems of much smaller size. 



2.2 Sparse matrix approximations 

Let us write the spectral decomposition of M G S„ as 



M 



i=i 



where Ui G R n for i = 1, . . . , n and A G R" are the eigenvalues of M with Ai > . . . > A n (we assume they 
are all distinct). Let a G [0, l] n , we measure the incoherence of the matrix M as 



n{M,a) = ^2\\i\n ai \ 



u i\\oo 



(3) 



i=l 



Note that this definition is slightly different from that used in [CT09] because we do not seek to reconstruct 
the matrix M exactly, so the tail of the spectrum can be partially neglected in our case. As we will see 
below, the fact that we only seek an approximation also allows us to handle sparse eigenvectors. 
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Let us define a matrix Q E S n with i.i.d. Bernoulli coefficients 

Qij ~ 

We can write 



1 jp with probability p 
otherwise. 



Q 



l — p 
p 



c 



where C is has i.i.d. entries with mean zero and variance one, defined as 

y/ (1 — p)/p with probability p 



a 



- y/p/ (1 — p) otherwise. 
We can now write the sampled matrix S in (1) as 



S = M oQ = M + 



M + E 



(4) 



and we now seek to bound the spectral norm of the residual matrix E as n goes to infinity. Naturally, if 
|| i^ll 2 is small, S is a good approximation of M in spectral terms, because of Weyl's inequality and the 
Davis-Kahan sin (9) -theorem (see [Bha97]). So our aim now is to control \\E\\ 2 so we can guarantee the 
quality of spectral approximations of M made using the sparse matrix S which is computationally easier to 
work with than the dense matrix M. We now make the following key assumptions on the incoherence of the 
matrix M. 



Assumption 1. There is a sequence of vectors a^ n ' G [0, l] n for which 

/i(M,a (n) ) < n and Card(ui) < 
as n goes to infinity, where (i is an absolute constant. 



n 



l,...,n 



In what follows, we will drop the dependence of a on n to make the notation less cumbersome, so 
instead of writing we will just write a. We have the following theorem. 

Theorem 1. Suppose that Assumption 1 holds. Let us call a m i n = mini<j< n a^. Assume that p and n are 
such that, p < 1/2, and for a given 5 > 0, a m j n > (log n)( 5 ~ 3 )/ 4 and 



(a min log n) 4 
pn amin 



, as n — )■ oo, 



then we have 



limsup ||-E|| 2 < 2/x (pn amin ) 



-1/2 



a.s 



(5) 



Proof. Using [HJ91, Th. 5.5.19] or the fact that uu T o C = D U CD U , where D u is a diagonal matrix with 
the vector u on the diagonal (remember that ||-|| 2 is a matrix norm and hence sub-multiplicative), we get 



\E\ 



1 — p 



UiUi 



i=l 



< 



1-p 



i=i 



U: 



(6) 
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Since we assume that the vector m is sparse with Card(uj) < n ai , C ai is a principal submatrix of C with 
dimension n ai . Now, we show in Theorem A-l (this is the key element of the proof - see p. 17) that 



lim sup 

n— >oo 



<2, 



whenever p = o 



, and a m ; n > (log re) ^ 3 ^ 4 for some 5 > 0. (Our proof of Theorem A-l 



relies on a result of Vu [Vu07] and Talagrand's inequality.). This yields Equation (5) and concludes the 
proof. □ 

The proof of the theorem makes clear that the error term coming from the sparsest eigenvector will 
usually dominate all the others in the residual matrix E. 

In these approximation methods, we naturally want to use a small p, so that S is very sparse and the 
computation of its spectral decomposition is numerically cheap. The result of Theorem A-2 guarantees that 
the subsampling approximation works whenever p S> (a m i n log n) /n amln (asymptotically, but we have in 
mind a very high-dimensional setting, so n will be large in practice). 

A natural question is therefore whether we could use p much smaller than this. Separate computations 
(see Subsection A-3) indicate that HC/re 1 ' 2 ]^ goes to infinity if p < (log n) l ~ & jn, which suggests that 
this subsampling approach to approximating eigenproperties of M might run into trouble if the sampling 
rate p gets smaller than log n/n. As a matter of fact, we could not control the quantities ||Cai /n ai ^ 2 || 2 
at this sampling rate, which is naturally problematic given the way we established the bound on ||-E|| 2 - 
Furthermore, if the sparsest eigenvector had support disjoint from the supports of all other eigenvectors, E 
would be the sum of two block diagonal matrices. Hence, its operator norm would be the maximum of the 
operator norms of the two blocks, at least one of which having potentially very large operator norm. 



2.3 Tightness 

Note that, in the limit case a = 1 where the eigenvectors are fully dense and incoherent, our bound is similar 
to the original bound in [AM07, Theorem 1.4] or that of [KMO09, Th 1.1] (our model for M is completely 
different however). In fact, the bounds in (2) and (5) can be directly compared. In the fully dense case where 
a = 1, we have 



y/n\\M\ 



£ 

i=i 



r 



XiUiUi 



< ri 



-1/2 



El* 



n\\Ui 



< n 



-1/2 



so in this limit case, the original bound in (2) is always tighter than our bound in (5). However, in the sparse 
incoherent case where the ratio of the bound (2) in [AM07] over our bound (5) becomes 

Li=i Kn 2 u iUi 



ziLii^K 



U; 



which can be large when a m - m < 1. The results in [KMO09], which are focused on exact recovery of low 
rank incoherent matrices, do not apply when the eigenvectors are sparse (i.e. a / 1). 
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2.4 Approximating eigenvectors 

We now study the impact of subsampling on the eigenvectors and in particular on the one associated with 
the largest eigenvalue. We have the following theorem. 

Theorem 2. Assume that the eigenvalues ofM are simple. Let us call 6 R n and Xk(S) the k-th eigenpair 
of S, and u^ £ R n , A& the k-th eigenpair of M. We write the reduced resolvent of M associated with 
Uk, defined as 



Ri 



1 



and let A^ = Rk{E — (Xk(S) — Xj.)ld). We also call dk the separation distance of A&, i.e the distance 
from Xk to the nearest eigenvalue of M. If\\E\\ 2 satisfies \\E\\ 2 < d^/2, then 



v k -u k + 



£(-l)»>A' 



m=0 



RkEu k 



< - 
~ 2 



1 (2 \\E\\ 



i+2 



1 



1 _ 2 II- E ll2 
- 1 - j 



(7) 



having normalized so v^u^ = 1. 

Proof. From now on we focus on Uk and drop the dependence on k in Uk, Vk, Rk, etc... when this does 
not create confusion. We also use the notation X$ and A instead of Xf.(S) and Xf.. If v is normalized so that 



v u 



1 (so (v — u) u = 0), we have the explicit formula [Kat95, Eq. 3.29] 



v — u 



-(Id + R{E - ~f\<\))- 1 REu 



where 7 = A5 — A. The formula is valid as soon as (Id + R(E — "fid)) is invertible. Let us now call 
A = R(E — jJd) and assume that A has no eigenvalues equal to -1, i.e Id + A is invertible. Then we have 



v — u + 



E(-i)"w 



.m=0 



REu = (-l) J 'A J+1 (Id + A^REu 



(8) 



We also have by construction Ru = 0, so REu = An. Hence, we can write 

j 



u + 



J2(-l) m A r 



m=0 



REu 



-iyA 3+2 {u + A) 



u . 



Now let us call d the separation distance of A. Then ||i?|| 2 = 1/d. Our assumptions guarantee that \\E\\ 2 
is such that 2 ||-E|| 2 /d < 1. We note that using Weyl's inequality, \Xs — A| < \\S — M\\ 2 = ||^|| 2 > hence 
2 \\E\\ 2 jd and 



|A|| < 2||i2|| 2 ||E|| 2 



\(Id + A)-%< 



1 



2IIBII 



Putting all the elements together and recalling that ||u|| 2 = 1, we get (7) from Equation (8). 



□ 



Spectral methods tend to focus on eigenvectors associated with extremal eigenvalues, so let us elaborate 
on the meaning of Theorem 2 for the eigenvector associated with the largest eigenvalue. If we suppose 
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that the spectral norm of the residual matrix E is smaller than half the separation distance of the largest 
eigenvalue, i.e 

\\E\\ 2 < (Ai-A 2 )/2, (9) 

the previous result (and results such as [Kat95, Theorem II.3.9]) shows that we can use perturbation expan- 
sions to approximate the leading eigenvector of the subsampled matrix. Based on the bound in Equation 
(5), the condition stated in Equation (9) will be satisfied (asymptotically with high-probability) if, for some 

e > 0, 

<(Ai-A 2 )/(4 + e). 



y/pn a 

We note that assumption (9) is likely reasonable if one eigenvalue is very large compared to the others, 
which is a natural setting for methods such as PCA. (Note however that our result is not limited to the 
largest eigenvalue but actually applies to any eigenvalue of the original matrix M, A, for which ||-E|| 2 is 
smaller than half the distance from A to any other eigenvalue of M. In particular, the result would apply to 
several separated eigenvalues.) We also note that the approximation 



u 



REu 



_m=0 

is accurate to order j + 2. 

Let us now try to make our approximation slightly more explicit. If we write R the reduced resolvent of 
M (associated with tti), and assume that Ai — A 2 stays bounded away from 0, we have in this setting, using 
Equation (7) with j = 1, 

v = u- REu + R(E - (Ai(5) - Ai) ld)REu + P (\\E\\l) , 

and therefore 

v = u - REu + R{E - u T Eu\A)REu + P {\\E\\l) , (10) 

after we account for the fact that u T Eu is an order- \\E\\^ accurate approximation of \\{S) — \\ [Kat95, 
Eq. 2.36 and 3.18]. This approximation makes clear that a key component in the accuracy of our approx- 
imations will be the size of the vector Eu. For simplicity here, we have normalized v so that v T u = 1; a 
similar result holds if we set v T v = 1 instead, if for instance ||-E7|| 2 — > asymptotically. 



2.5 Second order accuracy result for eigenvectors by averaging 

In light of Equation (10), it is clear that v is a first order accurate approximation of u, because of the 
presence of the (first-order) term REu in the expansion. We now show that we can get a second order 
accurate approximation of the eigenvector u. Our results are based on an averaging procedure and hence are 
easy to implement in a distributed fashion. We have the following second-order accuracy result. 

Theorem 3. Let us call u\ the eigenvector associated with the largest eigenvalue of M, and v\ = v\j || 
the eigenvector associated with the largest eigenvalue of S and normalized so that = 1 and vfu\ > 0. 
Let us call £ = ^/(pn amin ) 1//2 . Suppose that the assumptions of Theorem 1 are satisfied (hence £ —> 0). 
Suppose also that d = (Ai — A 2 ) satisfies 

d > ZVMZ- 2 ) ■ (11) 

Then we have ^ ^ 

e - my = o ( - * 9 ^ ) = o (L\ . 
L " " 2J v(Ai - A 2 ) 2 pn Q mmy yd? j 



i 



Practically, this means that if we average eigenvectors over many subsampled matrices (after removing 
indeterminacy by always making the first component positive), the residual error will be of order 1 1 _E7 1 1 § / cZ 2 
with 

limsup Il-Elll < 4 — — — . 

In other words, by averaging subsampled eigenvectors, we gain an order of accuracy (over the method that 
would just take one subsampled eigenvector) by canceling the effect of the first order residual term REu. 

Proof. To keep notations simple, we drop the index 1 in v and u in the proof (so v\ = v and u\ = u). In 
what follows, k is a generic constant that may change from display to display. Before we start the proof per 
se, let us make a few remarks. 

First, there is a technical difficulty when trying to work directly with v, namely the fact that it appears 
difficult to control E [|| (Id + A) -1 1| ] and hence to get a bound on E[||u — u\\] (with the normalization 
v T u = 1, \\v\\ could be very large; our bounds show that this can happen with only low probability but 
obviously E[||v||] could still be large). To go around this difficulty, we need two steps: first, we work with 
unit eigenvectors (so we go from v to v), and second we need a "regularization" step and will replace v 
by a vector v e which is equal to v with high-probability and for which we can control E[||v 6 — u\\]. More 
precisely, for e > 0, we call v e the vector such that 

\v if 1 1 (Id + A)" 1 !! < \ 

I u — REu + AREu otherwise. 

Its properties are studied in Theorem A-3. We call it below the e-regularized version of v. 

We note that under the assumptions of the current theorem we have | — > 0, so the results of Theorem 
A-3 apply. In particular, as shown in the proof of that Theorem, we have HMH^/p 2 = o (£ 2 )- Also, 
Assumption 1 (which is made in Theorem 1), means u is fixed so £ — > 0, as pn amin — > oo. 

If v is the eigenvector of S associated with its largest eigenvalue, using the fact that (v — u) T u = by 
construction, we have 

IMI2 = W v ~ u \\\ + IMI2 = 1 + ||w — it]! 2 . 

hence 



1 1 11 1- 
1 + ||u — it|| 2 

Turning our attention to v £ , we see that, since Ru = by construction and R is symmetric, u T A = 0, so 
(v e — u) T u = 0, and hence 



l~ II 52 n , 



Now let us call 



1 1 ll~ Il2 
1 + \\V £ — U\ 



12 

we see that (3 = v as long as 1 1 (Id + A) _1 || 2 < l/e, since when this happens, v = v £ . Now we have 

E[||u - u\\ 2 ] = E[\\u - v\\ 2 l v= p\ + E[||u - u\\ 2 l v ^p\ 
<^\u-P\\ 2 \ v= p] + E[||u-i/|| 2 1^] 
<E[\\ u -f3\\ 2 ] + 2P(v^f3), 
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since ||u — u\\ 2 < IMI2 + IMI2 = ^ ( n °te the importance of the change of normalization here, as this bound 
would not hold with v instead of v). Let us now work on controlling both these quantities. For reasons that 
will be clear later, we now take e = 2£/d. 



Control of E[ || m - (3\\ 2 ]. Given that u - (3 = (u-v £ )/\/l + \\u - v £ \\ 2 + u(l - l/yjl + \\u - v e \\ 2 ), 
we have 

u - v £ \\ 2 ,,,,/. 1 



u - 0|| 2 < CIIZ = + ||u|U I 1 



11 n2 I / 11 m2 

1 + ||« — v £ \\ 2 \ yl + \\u — v £ \\ 2 



< \\u - v £ \\ 2 + (y 1 + 1|«- v £ \\l - 1) 

< 2 \\u - v £ \\ 2 , 

since Vl + x 2 < 1 + x for x > 0. Let us call [if (pn a ™ in ) 1 / 2 = £ and d = \\ — A2. We show in Theorem 
A-3 that, for some k > 0, asymptotically 

E[||«-t; e || 2 ]<«(^ + i i ) 

so when s > £/<i, we have E [||u — u £ || 2 ] < and therefore 

B[\\u-P\\ 2 ]<k^. 

Control of P(y 7^ (3). We have (essentially) seen in the proof of Theorem 2 above that if 2 ||_E|| 2 /d < 
1 — e, then ||(Id + A) _1 || 2 < 1/e (see also the proof of Theorem A-3). Hence 



P (||(Id + > 1/e) < P (\\E\\ 2 > 



Recall that we have now chosen e = 2£/d. In that case, we have 

2 2 5 ' 

Now we show the following deviation inequality in Theorem A-2: if the is a median of ||-E|| 2 > 

P(|p|| 2 -m B |>t)<4exp( i^rt 2 ) . 



8\\M\ 



00 



Recall also that for n large enough < nig < 3£ when the conditions of Theorem 1 apply (see Theorems 
1 or arguments at the end of the proof of Theorem A-l). Suppose now that n is such that indeed hie < 3£. 
Then if f - 4£ > 0, we have 

p (W > <4^) < p (\im, - -«i > ^ - m E ) < p (\m, - >i-M). 
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Now when £/d — > 0, | — 4£ > | asymptotically. Since we assumed that d > £-y/ln(£~ 2 ) and £ — )• 0, we 
indeed have £/d — > 0. Therefore, 



CO 



All we have to do now is to verify that the asymptotics we consider, the quantity on the right-hand side of the 
previous equation remains less than £ 2 /d 2 asymptotically. Elementary algebra shows that this is equivalent 
to saying that 

II Mil 2 II Mil 2 

d 2 - 72^^ \n(d 2 ) > 72U°o(-ln(£ 2 ) + ln4) . (12 ) 

pZ pZ 

We have HMH^/p 2 = o (C 2 )> so the right-hand side is going to zero. In particular, we see that when 
d > £ -\/ln(£; _2 ), as we assume, the inequality above is satisfied asymptotically. As a matter of fact, when 
d < exp(l), 

1 1 2 

d 2 - 72^|^ ln(d 2 ) > d 2 , 
pZ 

and the result comes out of the fact that ^ M \°° = o (£ 2 )- If d > exp(l), the result is obvious as the right- 
hand side of Equation (12) goes to asymptotically, while the left-hand side is asymptotically larger than 
exp(2)/2 for instance. So we have shown that under our assumptions, 



We can finally conclude that 



E[||i/-u|| 2 ] < K-p 



as announced in the theorem. □ 

This result applies to all eigenvectors corresponding to eigenvalues whose isolation distance (i.e distance 
to the nearest eigenvalue) satisfies the separation condition (11), which is a strong version of the separation 
condition (9). We note that we need the strong separation condition (Equation (11)) to be able to take 
expectations rigorously. 

Finally, we note that theoretical as well as practical considerations seem to indicate that condition (9) 
(and hence (11)) is quite conservative. On the theoretical side, we see with Equation (8) that what really 
matters for the quality of the approximation is the norm of the vector 

lj = A J+2 (Id + A)" 1 ^ , 

or its expectation. We used in our approximations the coarse bound ||A|| 2 < 2 ||i2|| 2 ||2£|| 2 , which is con- 
venient because it does not require us to have information about the eigenvectors of A. However, we see 
that the norm of lj could be small even when ||i?|| 2 \\E\\ 2 is not very small, for instance if u belonged to a 
subspace spanned by eigenvectors of A associated with eigenvalues of this matrix that are small in absolute 
value. So it is quite possible that our method could work in a somewhat larger range of situations than the 
one for which we have theoretical guarantees. This is what our simulations below seem to indicate. 
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2.6 Variance 



The expansion in Equation (10) also allows us to approximate the variance of the first-order residual REu 
after subsampling. This is useful in practice because it gives us an idea of how many independent computa- 
tions we need to make to essentially void the effect of the first order term in the expansion of v. In terms of 
distributed computing, it therefore tells us how many machines we should involve in the computation. We 
have the following theorem. 

Theorem 4. Let u\ be the eigenvector associated with \\, the largest eigenvalue of M. Let us call w\ = 
u\ o u\, and A4 = M o M. Then 



E[\\RE Ul \\ 2 2 ] < 



1 



1 — p 



(A 2 -A!) 2 p 



\k=l 



2w^Mw 1 - w\(k)M k k 



fc=i 



Assuming w.l.o.g. that Ai = ||M||2, this bound yields in particular 



E[\\RE Ul \\l] < 



1 



;i-a 2 /ai 



2 NumRank(M) 



V 



(13) 



where NumRank(M) = ||M||^/||M||2 is the numerical rank of the matrix M and is a stable relaxation 
of the rank, satisfying 1 < NumRank(M) < Rank(M) < n (see [RV07] for a discussion). 



Proof. By construction, E[_E] =0 and 



B[\\RE Ul \\l} = E[ujER 2 Eu 1 ] = J2~E[ 

J=2 



by definition of R. Now 



E, np \ o II 1 1 2 nr 2 
(n 1 Eiij) = \\Eui\\ 2 = u 1 E u\ , 



because E is symmetric, the Uj's form an orthonormal basis and ujEuj is the j-th coefficient of Eu\ in this 
basis, so the sum of the squared coefficients is the squared norm of the vector. Hence 

1 



E REu 



1 II 2. 



< 



(A 2 - Xif 



E[uJ E 2 m] - var(u{Eui)) 



The variance of u\Eu\ is easy to compute if we rewrite this quantity as a sum of independent random 
variables. Also, separate computations (see Appendix, Subsection A-4) show that E[i? 2 ] is a diagonal 
matrix, whose i-th diagonal entry is (1 — p)||Mj|||/p, where Mi is the i-th column of M. Hence, in that 
case, having defined w\ = u\ o u\ and M. = M o M, we get 



E[\\RE Ul \\ 2 2 ] < 



1 — p 



(A 2 -Ai) 2 p 



Y Ul (k) 2 \\M k \\ 2 



\k=l 



2wlMw! -^2wl(k)M 



kk 



k=l 



Assuming w.l.o.g. that Ai = ||M||2, we get (13). 



□ 
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2.7 Nonsymmetric matrices 

The results described above are easily extended to nonsymmetric matrices. Here M G R mxn , with m > n 
and we write its spectral decomposition 

n 

M = y^ajUjvf, 

i=l 

where ui G R n , V{ G R m and cr, > 0. We can adapt the definition of incoherence to 

n 

n(M,a,p) =5^o- i n a</2 ||«i|| 00 m /?i/2 ||i;i||oo 

i=l 

and reformulate our main assumption on M as follows. 

Assumption 2. There are vectors a G [0, l] n and (3 G [0, \] n for which 

fj,(M, a, 0) < fj, and Card(uj) < n a \ Card(wj) < m ft , i = l,...,n 
as m,n go to infinity with m = pnfor a given p > 1, where fi is an absolute constant. 
In this setting, using again [HJ91, Th. 5.5.19], we get 

n 

yi o-iC o (uivj) 

i=l 

where we have assumed that itj, Uj are sparse and C ah ^ is a n Qi x m ft submatrix of C. As in (5), we can 
then bound the spectral norm of the residual and we have 

limsup ||£J|| 2 < = (15) 

n — ^OO / a min ^min 

\/pn 2 m 2 

almost surely. Perturbation results similar to (10) for left and right eigenvectors are detailed in [Ste98] for 
example. 

3 Numerical experiments 

In this section, we study the numerical performance of the subsampling/averaging results detailed above on 
both artificial and realistic data matrices 

Dense matrices: PCA, SVD, etc. We first illustrate our results by approximating the leading eigenvector 
of a matrix M as the average of leading eigenvectors of subsampled matrices, for various values of the 
sampling probability p. To start with a naturally structured dense matrix, we form M as the covariance 
matrix of the 500 most active genes in the colon cancer data set in [ABN + 99]. We let p vary from 10" 4 to 
1 and for each p, we compute the leading eigenvector of 1000 subsampled matrices, average these vectors 
and normalize the result. We call u the true leading eigenvector of M and v the approximate one. We 
now normalize v so that ||u|| 2 = 1 (which is standard, but different from the normalization we used in our 
theoretical investigations where we had u T v = 1). 



=i 



771^/4 Ik. II 
I OO ' ' 1 1 1 u l 1 1 oo 



a 



n a i/i m ^/i 



(14) 
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Figure 2: Left: Alignment u v between the true and the normalized average of 1000 subsampled eigenvec- 
tors (blue circles), median value of u T v over all sampled matrices (solid black line), with dotted lines at 
plus and minus one standard deviation and proportion of samples satisfying the perturbation condition (9) 
(dashed red line), for various values of the sampling probability p on a gene expression covariance matrix. 
Right: Zoom on the the interval p G [10~ 2 , 1]. 



In Figure 2, we plot u T v as a function of p together with the median of u T v sampled over all individ- 
ual subsampled matrices, with dotted lines at plus and minus one standard deviation. We also record the 
proportion of samples where \\E\\ satisfies the perturbation condition (9). 

We repeat this experiment on a (nonsymmetric) term-document matrix formed using press release data 
from PRnewswire, to test the impact of subsampling on Latent Semantic Indexing results. Once again, we 
let p vary from 10~ 2 to 1 and for each p, we compute the leading eigenvector of 1000 subsampled matrices, 
average these vectors and normalize the result. We call u the true leading eigenvector of M and v the 
approximate one. In Figure 3 on the left, we plot u T v as a function of p together with the median of u T v 
sampled over all individual subsampled matrices, with dotted lines at plus and minus one standard deviation. 
The matrix M is 6779 x 11171 with spectral gap 02/(71 = 0.66. 

In Figure 3 on the right, we plot the ratio of CPU time for subsampling a gene expression matrix of 
dimension 2000 and computing the leading eigenvector of the subsampled matrix (on a single machine), over 
CPU time for computing the leading eigenvector of the original matrix. Two regimes appeal - , one where the 
eigenvalue computation dominates with computation cost scaling with p, another where the sampling cost 
dominates and the speedup is simply the ratio between sampling time and the CPU cost of a full eigenvector 
computation. Of course, the principal computational benefit of subsampling is the fact that memory usage 
is directly proportional to p. 

A key difference between the experiments of Figure 2 and those of 3 is that the leading eigenvector of 
the gene expression data set is much more incoherent than the leading left eigenvector of the term-document 
matrix, which explains part of the difference in performance. We compare both eigenvectors in Figure 4. 

We then study the impact of the number of samples on precision. We use again the colon cancer data set 
in [ABN+99]. In Figure 5 on the left, we fix the sampling rate at p = 10~ 2 and plot u T v as a function of the 
number of samples used in averaging. We also measure the impact of the eigenvalue gap A2/ Ai on precision. 
We scale the spectrum of the gene expression covariance matrix so that its first eigenvalue is Ai = 1 and 
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Figure 3: Left: Alignment u v between the true and the normalized average of 1000 subsampled left 
eigenvectors (blue circles), median value (solid black line), dotted lines at plus and minus one standard 
deviation and proportion of samples satisfying condition (9) (dashed red line), for various values of the 
sampling probability p on a term document matrix with dimensions 6779 x 11171. Right: Speedup in 
computing leading eigenvectors on gene expression data, for various values of the sampling probability p. 




Figure 4: Magnitude of eigenvector coefficients \ui \ in decreasing order for both the leading eigenvector of 
the gene expression covariance matrix (left) and leading left eigenvector of the 6779 x 11171 term document 
matrix (right). 



plot the alignment u T v between the true and the normalized average of 100 subsampled eigenvectors over 
subsampling probabilities p £ [10 -2 , 1] for various values of the spectral gap A2/A1 G {0.75, 0.95, 0.99}. 

Graph matrices: ranking. Here, we test the performance of the methods described above on graph matri- 
ces used in ranking algorithms such as pagerank [PBMW98] (because of its susceptibility to manipulations 
however, this is only one of many features used by search engines). Suppose we are given the adjacency 
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Figure 5 : Left: Alignment u T v between the true leading eigenvector u and the normalized average leading 
eigenvector versus number of samples, on the gene expression covariance matrix with subsampling proba- 
bility p = 1CP 2 . Right: Alignment u T v for various values of the spectral gap A2/A1 £ {0.75, 0.95, 0.99}. 




2000 4000 6000 8000 10° 10 2 10 4 1 0* 



nz = 36854 

i 

Figure 6: Left: The wb-cs . Stanford graph. Right: Loglog plot of the Pagerank vector coefficients for 
the cnr-20 00 graph. 

matrix of a web graph, with 

J Aij = 1 , if there is a link from i to j 
\ Aij = 0, otherwise, 

where A G R nxn ( ne such matrix is displayed in Figure 6). Whenever a node has no out-links, we link it 
with every other node in the graph, so that B = A + 51 T /n, with 5i = 1 if and only if degj = 0, where degj 
is the degree of node i. We then normalize B into a stochastic matrix Pf- = Bij/deg^ The matrix P 9 is the 
transition matrix of a Markov chain on the graph modeling the behavior of a web surfer randomly clicking 
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Figure 7: Ranking correlation (Spearman's p) between true and averaged pagerank vector (blue circles), 
median value of the correlation over all subsampled matrices (solid black line), dotted lines at plus and 
minus one standard deviation and proportion of samples satisfying the perturbation condition (9) (dashed 
red line), for various values of the sampling probability p. Left: On the wb-cs . Stanford graph. Right: 
On the cnr-2 graph. 



on links at every page. For most web graphs, this Markov chain is usually not irreducible but if we set 

p = C P 9 + (1 - c)ll T /n 

for some c E (0, 1], the Markov chain with transition matrix P will be irreducible. An additional benefit of 
this modification is that the spectral gap of P is at least c [HK03]. The leading (Perron-Frobenius) eigen- 
vector u of this matrix is called the Pagerank vector [PBMW98], its coefficients Ui measure the stationary 
probability of page i being visited by a random surfer driven by the transition matrix P, hence reflect the 
importance of page i according to this model. 

The coefficients of pagerank vectors typically follow a power law for classic values of the damping 
factor [PRU06,BC06] which means that the bounds in assumption 1 do not hold. Empirically however, while 
the distance between true and averaged eigenvectors quickly gets large, the ranking correlation (measured 
using Spearman's p [Mel07]) is surprisingly robust to subsampling. 

We use two graphs from the Webgraph database [BV04], wb-cs . Stanford which has 9914 nodes 
and 36854 edges, and cnr-2 00 which has 325,557 nodes and 3,216,152 edges. For each graph, we 
form the transition matrix P as in [GZB04] with uniform teleportation probability and set the teleportation 
coefficient c = 0.85. In Figure 6 we plot the wb-cs . Stanford graph and the Pagerank vector for 
cnr-20 00 in loglog scale. In Figure 7 we plot the ranking correlation (Spearman's p) between true and 
averaged Pagerank vector (over 1000 samples), the median value of the correlation over all subsampled 
matrices and the proportion of samples satisfying the perturbation condition (9), for various values of the 
sampling probability p. We notice that averaging very significantly improves ranking correlation, far outside 
the perturbation regime. 
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4 Conclusion 



We have proposed a method to compute the eigenvectors of very large matrices in a distributed fashion: 

1. To each node in a computer cluster of size N, we send a subsampled version Si of the matrix of 
interest, M. 

2. Node % computes the relevant eigenvectors of Si. 

3. The N eigenvectors are averaged together and normalized to produce our final estimator. 

The key to the algorithm is that Step 2 is numerically cheap (because Si is very sparse), and hence can 
be executed fast even on small machines. Therefore a cluster or cloud of small machines could be used to 
approximate the eigenvectors of M, a difficult problem in general when M is extremely large. 

We have shown that under carefully stated conditions, the algorithm described above will yield a second- 
order accurate approximation of the eigenvectors of M. This gain in accuracy comes from the averaging 
step of our algorithm. We note that arguments similar to the ones we used in this paper could be made 
to compute second-order accurate approximations of the eigenvalues of M. (We restricted ourselves to 
eigenvectors here because in methods such as PCA, the eigenvectors are in some sense more important than 
the eigenvalues.) Our results depend on a measure of incoherence for M that we propose in this paper. 
They also show that subsampling will work if the sampling probability is small, but is likely to fail if that 
probability is too small. 

Finally, our simulations show that we gain significantly in accuracy by averaging subsampled eigenvec- 
tors (which suggests that our theoretical passage from first-order to second-order accuracy is also relevant 
in practice) and that the performance of our method seems to degrade for very incoherent matrices, a result 
that is also in line with our theoretical predictions. 



A Appendix 

A-l On||C|| 2 

Let us consider the symmetric random matrix C with entries distributed as, for i > j, 

with probability p 
with probability 1 — p 



a 




(A-l) 



We assume that C is n x n. Our aim is to show that we can control ||C|| 2 and in particular its deviation 
around its median. We do so by using Talagrand's inequality. 
We have the following theorem. 

Theorem A-l. Suppose that we observe n matrices C ai ,for 1 < % < n with entries distributed as those of 
the matrix C just described. Suppose these matrices are of size n ai , where on are positive numbers. Call 
c^min = mini<i< n aj and assume that, for some fixed 5 > 0, a m j n > (log n)^ -3 ^ 4 . Suppose further that p 
is such that lim n _ i , 00 (a m i n log n) / (n Qmin p) = 0. Then 



lim sup 



a, 



n 



ai/2 



< 2a.s 



(A-2) 
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Proof. We note that the application C — > 1 1 C 1 1 2 is a convex, \/2-Lipschitz (with respect to Euclidian/Frobenius 
norm) function of the entries of C that are on or above the main diagonal. As a matter of fact, since || • || is 
a norm, it is convex. Furthermore, if A and B are two symmetric matrices, 



\A-B\\ 2 < \\A-B\\ F 



I.J 



Now recall the consequence of Talagrand's inequality [Tal95] spelled out in [LedOl], Corollary 4.10 and 
Equation (4.10): if F is a convex, 1-Lipschitz function (with respect to Euclidian norm) on M n , of n inde- 
pendent random variables (Xi, . . . , X n ) that take value in [u, v], and if mp is a median of F{X\, . . . , X n ), 
then 

P(\F - m F \ >t)< 4exp(-t 2 /[4(w - v) 2 }) . (A-3) 



The random variables that are above the main diagonal of C sue bounded, and take value in 
We note that 

o 

1 



v kzE] 
i-pi V P 1 



1 — p 



+ 



V 



p y 1 — p J p(l — p) 
Therefore, calling m n the median of 1 1 1 / 2 C 1 1 2 , we have, in light of Equation (A-3), 



P 



C 



n 



1/2 



m r 



>t)< 4exp 



nV 



8/(p(l-p)) 



4exp 



t- 



■p(l — p)n 



(A-4) 



Suppose now that we have a collection C ai of matrices of size n ai with entries distributed as in Equation 
(A-l). (Note that the matrices could be dependent.) Let us call m n ^i the medians of ||C' Q!i /n ai / 2 || 2 . Then 
we have, by a simple union bound argument, for any k, 



P ( max 

,l<i<fc 



11 



Qi/2 



m n ai 



> tj < 4^exp (^-^p(l -p)n a ^ < 4/cexp - p)n Qmin ^ 



where a min = mini<j< fc oti. 

Suppose now that k = n, p < 1/2, pn amin > (logn) 1+<5 , and t > (logn)" 5 / 3 for some 5 > 0. Then, 
t 2 p(l — p)n Qmin > (log n) 1+<5 / 3 /2, which tends to 00 as n — > 00. Because u n = nexp(— (log n) 1+<5 / 3 /16) 
is the general term of a converging series, we have, when p < 1/2 and pn amin > (log n) 1+s for some 5 > 0, 



max 

Ki<n 



n 



o.i/2 



m n °<i 



< (logn) 5/3 a.s , 



by a simple application of the Borel-Cantelli lemma. Hence, we have 

Cry, 



max 

Ki<n 



n 



at/2 



< max m n ai + (logn) a.s 



Ki<n 



(A-5) 



Now all we have to do is control maxi<j<„ m n ai, which is the maximum of a deterministic sequence. 

Recall Vu's Theorem 1.4 in [Vu07], applied to our situation where we are dealing with bounded random 
variables with mean and variance 1 : if the matrix C has entries as above and is n x n, then almost surely, 



C 



n 



1/2 



< 2 + K 



1 — p 



1/4 



n l / A log(n) , 
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for some constant kq. So as soon as (log n) 4 /(pn) remains bounded, so does m n , the median of 
In particular, if (log n) 4 / (pn) — > 0, we have 

limsupm n < 2 . 



Using elementary properties of the function / such that f(t) = (log t) 4 /t, we can therefore conclude that if 
a m in is such that 



(a min log n) 



n 



0, 



we have 



limsup max m n «i < 2 . 
n— >oo l<i<n 



(Note that this is true because we are taking the maximum of elements of a fixed deterministic sequence that 
is asymptotically less than or equal to 2 + e, for any e and the smallest argument is going to infinity. All the 
work using Talagrand's inequality was done to allow us to switch from having to control the maximum of a 
random sequence to that of a deterministic sequence.) 

Now when (a m i n logn) 4 /(pn Qmin ) — > 0, we have a fortiori pn" min > (logn) 1+<5 when a m i n > 
(logra)^ -3 )/ 4 . So we conclude that when (a m ,i n logn) 4 /(pn Qmin ) — > and a m ; n > (log n)^ -3 )/ 4 , 



limsup max 

n->oo l<i<n 



c n 



n 



cti/2 



< 2 a.s . 



□ 



Let us now consider the related issue of understanding the matrix E = r p MoC, where r p = ^/(l — p) /p, 
M is a deterministic matrix and C is a random matrix as above. 

Theorem A-2. Suppose E = r p M o C, where C is a symmetric random matrix distributed as above, M is 
a deterministic matrix and r p = y/ (1 — p)/p. Let us call the a median of\\E\\ 2 . Then we have 



P(\ \\E\\ 2 - m E \ >t)< 4exp 



I M I 



Hence, in particular, 



and 



E 



I -El 



< mi + 32-! 



I Ml 



P 



+ 8m E 



E[\\E\\l] < 4m| + 120F f 8|l y io ° 




(A-6) 



(A-7) 



Proof. The crux of the proof is quite similar to that of Theorem A- 1 : we will rely on Talagrand's concentra- 
tion inequality for convex 1-Lipschitz functions of bounded random variables. To do so let us consider the 
map: C — > f(C) = \\M o C\\ 2 . This map / is convex as the composition of a norm with an affine mapping. 
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Let us now show that it is (y/2 ||M|| 00 )-Lipschitz with respect to Euclidian norm: if we denote by c\ k j the 
(i, j)-th entry of the matrix Ck, we have 

|/(Ci)-/(C 2 )| = |||Mo Ci|| 2 -||MoC 2 || 2 | < ||Mo(Ci-(7 2 )|| 2 



<\\ M o{ Cl -c 2 )\\ F = jY. M lM3- c '. 



2 (JX) _ J&\2 
1,3 



< maxl^l JJ2(c% -c%) 2 < UMIU^ /g(pg 



Hence, / is indeed a (y/2 ||M|| 00 )-Lipschitz function of the entries of C that are above or on the diagonal. 
Now the function of C we care about is <?(•) = r p f(-), which is convex and v2 || M r p - Lipschitz. Given 
that the entries of C are bounded, we have, as in the proof of Theorem A-l, 



P(| n - mE \ > t) < 4exp {-^rf) = 4exp 



2 



Now using the proof of Proposition 1.9 in [LedOl] (see p. 12 of this book), we conclude that 



'27r||M| |2 



E [| ||£|| 2 - m E \\ < 4^ — > and 

II Mil 2 

E[|||£|| 2 -m£| 2 ] <32^ . 

Therefore, 

|2 



E 



1^112 



< m | + 32H£ +8mB ./ 2 -iwi 2 oc 



9 -C* \ / 9 5 



since for a and b positive, a 2 < b 2 + (a — b) 2 + 2b\a — b\. 

More generally, we see, using essentially Proposition 1.10 in [LedOl] and elementary properties of the 
Gamma function, that if the random variable F is such that for a deterministic number ap, P(\F — ap\ > 
t) <C exp(— cr 2 ), then 

E[|F-OF| fc ] < CT(^ + l)cT fc / 2 . 
Applying this result with k = 3, we get 

E \\ \\E\L -m E \ 3 } < 



, ,,9 \ 3 /2 

MM 

I lloo I 



p- 



In our context, using the fact that, for positive a and b, (a + b) 3 < 4(a 3 + 6 3 ) by convexity, we also have 



i2 

3 1 ^ A I ™3 i o f— I ° 11 1V1 lloo 



E[||£|$<4 m| + 3^ 



8||M 



3/2 N 



pi 



□ 
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A-2 Regularized eigenvector considerations 



We now have the following (regularized) second order accuracy result, which is a critical component of the 
proof of Theorem 3, one of the main results of the paper. 

Theorem A-3. Suppose that the assumptions of Theorem 1 are satisfied. We consider the approximation ofu 
the eigenvector associated with the largest eigenvalue of M. Recall that v is the eigenvector corresponding 
to the leading eigenvalue of the subsampled matrix S. For e > 0, we call v £ the vector such that 

%= (vif\\(Id + A)-%<l 

I u — REu + AREu otherwise 

Then, for any rj > 0, we have asymptotically, 

11 1 Vel112 ~ (Ai - A 2 ) 2 pn"™* e(Ai - A 2 ) 3 \pn<*™n J ' 

Suppose further that we are in an asymptotic setting where x 1 ~x 2 (pn^min) 1 /^ ~~ ^ ^- Then, v — v e = with 
high-probability. 

Proof. Let us first show that our regularization does not change the vector we are dealing with with high- 
probability. v £ = v as long as ||(Id + A)" 1 ]^ < 1/e, which is guaranteed if 2 ||-E|| 2 /d < 1 — e. Since 
we assume that x 1 -x 2 ^Iji/; ~~ * m & we have according to Theorem A-2 ||J5|| 2 < 2 ^ ra a n ^ in ^i/ 2 with 
high-probability, we conclude that with high-probability, v £ = v. 

Using Equation (8) with j = 1, we see that, since ||A|| 2 < 2 ||-R|| 2 ||-E?|| 2 , 

1 1 3 1 1 1 1 3 

\\v e -(u- REu + AREu)\\ 2 < ^ \\A\\ 2 2 \\RE\\ 2 < ^&JS _ 

Recall that by construction E[E] = 0. Hence, since R is a fixed deterministic matrix and u is a deterministic 
vector, 

E [v e - u] = E [fi e - u + i?£u] . 
So, if we now use the fact that ||u|| = 1, we have 

||E [fi £ - u]|| 2 = ||E [fi £ - u + || 2 

< ||E[u e -u + ilB«- AAEit]|| 2 + ||E[AR£7u]|| 2 

< E [||fi e - u + REu - AREu\\ 2 ] + E [||A.R.Eit|| 2 ] 



< E 



iHlHl + 2\\R\\l\\E\\l 



Let us now show that we can control the right-hand side of the previous equation. 
We prove in Theorem A-2 that 



E <ml + 32 Mi + 9m J 2 -^L , 

l p A y p z 
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where vtle is a median of the random variable \\E\\ 2 . Our asymptotic control of ||i?|| 2 in (5) gives allows us 
to control uie, namely, 

2 a V 2 

limsupm^ < 4 



pn 



"min 



In other respects, we clearly have HM]^ < Ya=i I \ u i\ loo' an( * hence 



Hence, 



IMI 



< 



P 



since we are in a setting where pn amin — > oo. Similarly, 7Ue\/ ^ M |°° = o ( DW a min ) , so we have for 77 > 0, 



pn 



2 H^lla E 



\E\ 



< 



(Ai — A2) 2 pn Qm ™ 



asymptotically. 

Furthermore, we prove in Theorem A-2 that 



E[\\E\\l] <4m| + 
Hence, for 7/ > 0, 



12^( - 



IMI 



3/2 



■ pi 



4||i2|||E[p||i] < 



< 4mi + o 



^■'mii: 



3^ 



16 + n 
(Ai - A 2 ) 3 



3/2 



□ 



A-3 On ||C|| 2 whenp (logn)/n 

At the end of Subsection 2.2, we mentioned a corollary (see below) of the following theorem: 

Theorem A-4. Suppose that p = (log n) 1-<5 -u n /n, for a fixed 5 in (0, 1) and for a fixed k, < u n < k. 
Suppose further that we can find v n > such that v n — > 00, while v n = o(log n, [u~ ^logn) 15 ] 1 / 4 ). Then 

\\C/s/n\\2 — > 00 with probability one. 

Recall that practically, this theorem suggests that if we don't sample enough the matrix M (i.e p is too 
small), a subsampling approximation to its eigenproperties is not likely to work. Let us now prove it. 

Proof. Our strategy is to show that the largest diagonal entry of C T C/n goes to infinity. To do so, we will 
rely on results in random graph theory. Let us examine more closely this diagonal. Using the definition of 
C, we see that, if T = C T C, and di is the number of times ^/(l — p)/p appears in the 2-th column of C, 

T \h V = + d i • 

1 — p \ p 1 — p J 
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Now {di} is the degree sequence of an Erdos-Renyi random graph. According to [BolOl], Theorem 3.1, if 
k is such that n( n ~ 1 )p k (l — p^ n ~ 1 ~ k — y oo, then, if is the number of vertices with degree greater than 

k, 

lim P(X k >t) = l, 

n— >oo 

for any t. So if we can exhibit such a k, then max di > k with probability going to 1. We now note that for 
small p, 

1 ~P P \ > 1 



p 1 — p J 2p 
Hence, if our k is also such that k/pn — > oo, we will indeed have 

T(i,i) 
max > oo 

i n 

and the theorem will be proved. 

We propose to take k = np(l + v n ). According to [BolOl], Theorem 1.5, if h = k — np, and q = 1 — p, 

n V fc (l P) n ~ k > 1 exp ( h * h * HA h p\ (A 8) 

pj ~ yj2wpqn \ 2pqn 2q 2 n 2 3p 3 n 3 pn J 

where j3 = 1/ (12/c) + l/(12(n — k)). In our case, h = npv n . Let us show that all the terms in the 
exponential are negligible compared to log n as n — > oo: 

• P — > because k — > oo and npv n = o ((log re) 2 "" 5 ) , given that v n = o (log n). Hence n — k — > oo. 

• h/{pn) = v n = o (logn) by assumption. 

• h 4 /(pn) 3 = npv\ = o (ii n (log n) 1_<5 (log n) s /u n ) = o (logn), since v n = o ((it" 1 (logn) 5 ) 1 / 4 ). 

• h 3 /n 2 = npVnP 2 = o (upv^p 2 ) = o (p 2 logn), since v 3 = o (f 4 ) (t» n — > oo by assumption). 

• h 2 /np = npv\ = o (npyf/j = o (logn). 

In light of these estimates, we have as n — > oo, 

h 2 h 3 h A h 

«exp — ^-j - —3-3 ^ -»• 00 . 

zpgn Lq A n* Sp^n pn 



Therefore, with this choice of k, 



We can finally conclude that 



n ( n p % 



k 

maxT(j,i)/n > with probability going to 1 . 

i 2np 

But because v n — > 00, we have kj (2np) — > 00 and the theorem is proved. □ 

We have the following corollary to which we appealed in Subsection 2.2. 

Corollary A-5. When p ~ (log n) 1_<5 jnfor some fixed 5 G (0, 1), 

||C/\/n||2 — > 00 with probability one. 

The previous corollary follows immediately from Theorem A-4, by noticing that u n is lower bounded 
under our assumptions and by taking v n = (log n)" 5 / 5 . 
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A-4 Variance computations 

We provide some details here to complement the explanations we gave in the proof of Theorem 4 in Sub- 
section 2.6. 

On E[i? 2 ] Let us explain why this matrix is diagonal and compute the coefficients on the diagonal. Recall 
that E = — p)/pM o C, where C is a random matrix whose above-diagonal elements are independent, 
have mean and variance 1. Sis naturally symmetric and we call E{ its i-th column. Naturally, E 2 (i,j) = 
Ej Ej. Suppose first that i ^ j. The elements of Ei and Ej are independent, except for Eij and Eji, which 
are equal. In particular, E k i and Ekj are independent for all 1 < k < n. Recall also that E[C] = 0, so 
E[E] = 0. Combining all these elements, we conclude that, if i ^ j, 

n n 

B[EfEj} = ]T E[E ki E kj ] = £ E[iy E[E kj \ = . 

k=l k=l 

Therefore EfE 12 ] is diagonal. Let us now turn our attention to computing the elements of the diagonal. This 
is simple since 

n n 

HEf Ei ] = —-^j2 M ki~ E [ E li\ ~E^~ mwl ■ 

k=l k=l 

We note that this is the result we announced in the proof of Theorem 4 in Subsection 2.6. 

On var(u T Eu) Rewriting this quantity as a sum of independent quantities greatly simplifies the compu- 
tation. If we pursue this route, we have 

u T Eu = ^ u(i)u(j)Ejj = 2 J ^2u(i)u(j)E ij + '^2u(i) 2 E ii . 

i,j i>j i 

Because the previous expression is a sum of independent random variables, we immediately conclude that 

jP- v a r(u T Eu) = 4 ]T u(z) 2 u(j) 2 M* + £ uiifM 2 

P i>j i 

= 2(2 £ u(i) 2 u(j) 2 M 2 + £ n(,) 4 M 2 ) - £ n(i) 4 M 2 . 

i>j i i 

Calling w = u o u and M. = M o M, we immediately recognize in the last expression the quantity 

2(w T Mw)-^w(k) 2 M kk , 

k 

as announced in the proof of Theorem 4. 
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